Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update benchmark data on VGG19 #5148

Merged
merged 2 commits into from
Nov 1, 2017

Conversation

tensor-tang
Copy link
Contributor

related #5008

Machine:

- Server
- Intel(R) Xeon(R) Gold 6148M CPU @ 2.40GHz, 2 Sockets, 20 Cores per socket
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 Sockets, 20 Cores per socket ,这样算是40 Cores。
我用cat /proc/cpuinfo看是80 processor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗯,我这里写的实core的数目。80代表你的机器超线程是开的。

- DELL XPS15-9560-R1745: i7-7700HQ 8G 256GSSD
- i5 MacBook Pro (Retina, 13-inch, Early 2015)
- Desktop
- i7-6700k
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Laptop和Desktop这里的型号信息不全,可以加TODO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗯,这个没问题,可以后续你们添加这一块测试数据的时候一起添加。
我这里的几个型号是issue #5008 里面你列的那几个型号。

- Desktop
- i7-6700k

System: CentOS 7.3.1611
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CentOS 6.3.10

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

哦,我用的是7.3的这一个。

|--------------|-------| -----| --------|
| OpenBLAS | 7.86 | 9.02 | 10.62 |
| MKLML | 11.80 | 13.43 | 16.21 |
| MKL-DNN | 29.07 | 30.40 | 31.06 |
Copy link
Contributor

@luotao1 luotao1 Oct 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我测出来的数据,整体慢1.5-2倍。其中OpenBLAS是源码编译,MKLML和MKL-DNN都是用docker镜像来跑。

BatchSize 64 128 256
OpenBLAS 4 4.92 未测
MKLML 4.7 6.4 7.68
MKL-DNN 20 20 21

Copy link
Contributor Author

@tensor-tang tensor-tang Oct 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

按照刚才你说的,看你的系统是开超线程的。那么
这里的配置最好写export KMP_AFFINITY="granularity=fine,compact,1,0"

我的脚本里面是关闭超线程的时候测的。

并且最好可以在运行的时候,用perf top看MKL-DNN的engine是否都运行正确了。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改成export KMP_AFFINITY="granularity=fine,compact,1,0后,测试结果依然一样。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗯,请问下BIOS的版本是什么?另外内存条是不是都插满了,以及频率是多少?

Copy link
Contributor

@luotao1 luotao1 Oct 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

使用dmidecode命令,这是打印结果
dmidecode.log.txt

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我觉得和docker无关。mklml和mkldnn都在docker中运行,取第一列数据,我的提升是(20-4.7)/4.7=3.25倍,你的提升是(29.07-11.8)/11.8=1.45倍。

mklml和mkldnn的数据是不是也可以本地编译一下?只要测一个数据,看看数据有没有提升即可。

我可以编译一下docker中的openblas版,来进行测试。

因为上次你说的本地编译时libc缺乏的问题,我觉得还是要解决下

benchmark最好以docker为环境,这样能避免环境不一样带来的不同。

Copy link
Contributor Author

@tensor-tang tensor-tang Oct 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mklml和mkldnn都在docker中运行,取第一列数据,我的提升是(20-4.7)/4.7=3.25倍,你的提升是(29.07-11.8)/11.8=1.45倍。

这样更加能说明一点问题了。我在docker外测得mklml与mkldnn的比率没有那么大,恰好说明了你在docker中mklml的值是偏低的了, 或者是有潜在的问题还没有被发现。

我可以编译一下docker中的openblas版,来进行测试。

嗯,这个我同意。把三者放在一个环境下比较好。

benchmark最好以docker为环境,这样能避免环境不一样带来的不同。

嗯,这个我也同意,如果可以的话,你可以把你的docker镜像分享给我一份吗,我用我的机器也跑下看看,先排除机器等基本配置问题。

Copy link
Contributor Author

@tensor-tang tensor-tang Oct 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

仔细查看了dmidecode的结果,发现机器的内存确实不是性能最优的配置。现在插了16根内存条,有8个内存公用了4个channel。
需要把CPU0_A1, CPU0_D1, CPU1_A1, CPU1_D1的内存条去掉。
如果板子上的槽分蓝色和黑色的话,即把所有黑色槽上的内存条去掉。

Copy link
Contributor

@luotao1 luotao1 Oct 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

非常感谢系统部的 @BlackZhengSQ 帮助我们调对了内存条。
目前MKLDNN下,batchsize=64, 数据为26.67。看上去内存对性能的影响很大。
但26.67和28.46还存在一定的差距。

BatchSize 64 128 256
MKLML 10.95 12.81 15.21
MKL-DNN 26.67 28.06 28.65

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

非常感谢系统部的 @BlackZhengSQ 帮助我们从CentOs4.3升级到CentOs6.3。
目前MKLDNN下,差距从原来的6%缩小到3%。

BatchSize 64 128
MKL-DNN 27.69 28.8

@tensor-tang
Copy link
Contributor Author

我用docker 最新的镜像paddlepaddle/paddle:latest, 在6148上简单的跑了一下MKLDNN 在bs64的情况,

I1030 12:57:26.494154 40 Stat.cpp:102] ======= StatSet: [GlobalStatInfo] status ======
I1030 12:57:26.494226 40 Stat.cpp:105] Stat=FwdBwd TID=40 total=224885 avg=2248.85 max=2322.83 min=2235.7 count=100

速度为 64/2.24885 = 28.46 与我之前测的29.07基本能对上。

@tensor-tang
Copy link
Contributor Author

看来CentOS的版本还是会有一些影响的,commit中的数据是我在7.2的版本裸机下跑的。

针对MKL-DNN的数据,我又在docker 1.12.6里面跑了下:
batchsize 64

I1101 08:12:39.871258    35 Stat.cpp:105] Stat=FwdBwd                         TID=35     total=223947     avg=2239.47    max=2306.18    min=2225.71    count=100

batchsize 128

I1101 08:20:38.068055   137 Stat.cpp:105] Stat=FwdBwd                         TID=137    total=422396     avg=4223.96    max=4285.01    min=4200.91    count=100

batchsize 256

I1101 08:36:04.149484   239 Stat.cpp:105] Stat=FwdBwd                         TID=239    total=829780     avg=8297.8     max=8360.78    min=8259.83    count=100

整理如下:

Batchsize 64 128 256
with docker (A) 28.58 30.30 30.85
witout docker (B) 29.07 30.40 31.06
differ = (B-A)/A 1.72% 0.32% 0.68%

误差最大1.7%,说明在docker内和外基本没啥差别。

对比在CentOS 6.3上面的数据,误差范围在3%左右。

@luotao1
Copy link
Contributor

luotao1 commented Nov 1, 2017

docker版本对性能的影响也不大。针对MKL-DNN的数据,batchsize=64的情况下,我在docker 1.6.0和1.13.1里面跑了下:

docker 1.13.1:

I1101 08:44:03.134213 39 Stat.cpp:105] Stat=FwdBwd TID=39 total=232649 avg=2326.49 max=2522.99 min=2301.4 count=100

docker 1.6.0:

I1101 08:03:47.377549 34 Stat.cpp:105] Stat=FwdBwd TID=34 total=231180 avg=2311.8 max=2405.7 min=2298.31 count=100

前面conversation中的数据,都是在docker 1.6.0下测的。

@tensor-tang
Copy link
Contributor Author

tensor-tang commented Nov 1, 2017

对比了下 @luotao1 更新的数据。

  1. 从绝对值上对比误差,集中在5%左右:
0.51% 4.64% 2.71%
7.08% 4.43% 5.74%
4.98% 5.56% 6.12%
  1. 相对值上,如果按照我以前的数据。
MKLML / OpenBLAS 1.50 1.49 1.53
MKL-DNN / MKLML 2.46 2.26 1.92

CentOS 6.3上的数据

MKLML / OpenBLAS 1.41 1.49 1.48
MKL-DNN / MKLML 2.51 2.24 1.91

误差在:

6.53% -0.20% 2.95%
-1.96% 1.08% 0.35%

从相对误差来看,就第一个数字误差大点,MKL-DNN/MKLML的都还好

@tensor-tang tensor-tang merged commit a343504 into PaddlePaddle:develop Nov 1, 2017
@tensor-tang tensor-tang deleted the benchmark branch November 1, 2017 14:42
@luotao1 luotao1 mentioned this pull request Nov 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants